AITopics | permutation hashing

Re-randomized Densification for One Permutation Hashing and Bin-wise Consistent Weighted Sampling

Neural Information Processing SystemsDec-25-2025, 19:31:26 GMT

Jaccard similarity is widely used as a distance measure in many machine learning and search applications. Typically, hashing methods are essential for the use of Jaccard similarity to be practical in large-scale settings. For hashing binary (0/1) data, the idea of one permutation hashing (OPH) with densification significantly accelerates traditional minwise hashing algorithms while providing unbiased and accurate estimates. In this paper, we propose a strategy named "re-randomization" in the process of densification that could achieve the smallest variance among all densification schemes. The success of this idea naturally inspires us to generalize one permutation hashing to weighted (non-binary) data, which results in the socalled "bin-wise consistent weighted sampling (BCWS)" algorithm. We analyze the behavior of BCWS and compare it with a recent alternative. Extensive experiments on various datasets illustrates the effectiveness of our proposed methods.

bin-wise consistent weighted sampling, permutation hashing, re-randomized densification, (4 more...)

Neural Information Processing Systems

Technology:

Information Technology > Data Science > Data Mining (0.89)
Information Technology > Artificial Intelligence > Machine Learning (0.83)

Add feedback

Reviews: Re-randomized Densification for One Permutation Hashing and Bin-wise Consistent Weighted Sampling

Neural Information Processing SystemsJun-1-2025, 06:03:51 GMT

The authors propose that the optimal densification for OPH can actually be further optimized. In usual OPH, we get one permutation of the sparse vector, break the vector into K equal sized bins. In the usual Consistent Weighted Sampling (CWS) approach, we sample non-empty bins from these K bins and retrieve a fixed hash code for these bins. In this new approach, the authors suggest to treat each of the K bins as a separate sparse vector and perform MinHash on these retrieved bins to get a hash code instead of directly getting a Hash code. The authors theoretically prove that this re-randomization achieves the smallest variance among densification schemes(that are used to retrieve hash codes from empty buckets). Also, they extend this idea to weighted non-negative sparse vectors (by a method called Bin-wise CWS) The paper seems to be a subtle improvement over prior work.

bin-wise consistent weighted sampling, consistent weighted sampling, re-randomized densification, (4 more...)

Neural Information Processing Systems

Genre: Summary/Review (0.41)

Technology:

Information Technology > Data Science > Data Mining (0.81)
Information Technology > Artificial Intelligence > Machine Learning (0.64)

Add feedback

Reviews: Re-randomized Densification for One Permutation Hashing and Bin-wise Consistent Weighted Sampling

Neural Information Processing SystemsJan-26-2025, 02:27:34 GMT

Overall the reviewers appreciated the idea, which although incremental, was quite subtle and interesting, and given also the importance of the problem, this was enough for the reviewers to all agree to push this paper over the bar.

bin-wise consistent weighted sampling, permutation hashing, re-randomized densification, (1 more...)

Neural Information Processing Systems

Technology:

Information Technology > Data Science > Data Mining (0.40)
Information Technology > Artificial Intelligence > Machine Learning (0.40)

Add feedback

Re-randomized Densification for One Permutation Hashing and Bin-wise Consistent Weighted Sampling

Neural Information Processing SystemsOct-10-2024, 15:24:39 GMT

Jaccard similarity is widely used as a distance measure in many machine learning and search applications. Typically, hashing methods are essential for the use of Jaccard similarity to be practical in large-scale settings. For hashing binary (0/1) data, the idea of one permutation hashing (OPH) with densification significantly accelerates traditional minwise hashing algorithms while providing unbiased and accurate estimates. In this paper, we propose a strategy named "re-randomization" in the process of densification that could achieve the smallest variance among all densification schemes. The success of this idea naturally inspires us to generalize one permutation hashing to weighted (non-binary) data, which results in the socalled "bin-wise consistent weighted sampling (BCWS)" algorithm.

bin-wise consistent weighted sampling, permutation hashing, re-randomized densification, (2 more...)

Neural Information Processing Systems

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

One Permutation Hashing

Neural Information Processing SystemsFeb-16-2024, 07:51:25 GMT

While minwise hashing is promising for large-scale learning in massive binary data, the preprocessing cost is prohibitive as it requires applying (e.g.,) k 500 permutations on the data. The testing time is also expensive if a new data point (e.g., a new document or a new image) has not been processed. In this paper, we develop a simple \textbf{one permutation hashing} scheme to address this important issue. While it is true that the preprocessing step can be parallelized, it comes at the cost of additional hardware and implementation. Also, reducing k permutations to just one would be much more \textbf{energy-efficient}, which might be an important perspective as minwise hashing is commonly deployed in the search industry.

minwise, permutation hashing, textbf

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.46)

Add feedback

C-OPH: Improving the Accuracy of One Permutation Hashing (OPH) with Circulant Permutations

Li, Xiaoyun, Li, Ping

arXiv.org Machine LearningNov-18-2021

Minwise hashing (MinHash) is a classical method for efficiently estimating the Jaccrad similarity in massive binary (0/1) data. To generate $K$ hash values for each data vector, the standard theory of MinHash requires $K$ independent permutations. Interestingly, the recent work on "circulant MinHash" (C-MinHash) has shown that merely two permutations are needed. The first permutation breaks the structure of the data and the second permutation is re-used $K$ time in a circulant manner. Surprisingly, the estimation accuracy of C-MinHash is proved to be strictly smaller than that of the original MinHash. The more recent work further demonstrates that practically only one permutation is needed. Note that C-MinHash is different from the well-known work on "One Permutation Hashing (OPH)" published in NIPS'12. OPH and its variants using different "densification" schemes are popular alternatives to the standard MinHash. The densification step is necessary in order to deal with empty bins which exist in One Permutation Hashing. In this paper, we propose to incorporate the essential ideas of C-MinHash to improve the accuracy of One Permutation Hashing. Basically, we develop a new densification method for OPH, which achieves the smallest estimation variance compared to all existing densification schemes for OPH. Our proposed method is named C-OPH (Circulant OPH). After the initial permutation (which breaks the existing structure of the data), C-OPH only needs a "shorter" permutation of length $D/K$ (instead of $D$), where $D$ is the original data dimension and $K$ is the total number of bins in OPH. This short permutation is re-used in $K$ bins in a circulant shifting manner. It can be shown that the estimation variance of the Jaccard similarity is strictly smaller than that of the existing (densified) OPH methods.

c-minhash, c-oph, permutation, (13 more...)

arXiv.org Machine Learning

2111.09544

Country:

North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.14)
Asia > Afghanistan > Parwan Province > Charikar (0.04)
Oceania > Australia > New South Wales > Sydney (0.04)
(17 more...)

Genre: Research Report (0.50)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)
Information Technology > Artificial Intelligence > Natural Language (0.93)

Add feedback

Re-randomized Densification for One Permutation Hashing and Bin-wise Consistent Weighted Sampling

Li, Ping, Li, Xiaoyun, Zhang, Cun-Hui

Neural Information Processing SystemsMar-19-2020, 03:17:07 GMT

Jaccard similarity is widely used as a distance measure in many machine learning and search applications. Typically, hashing methods are essential for the use of Jaccard similarity to be practical in large-scale settings. For hashing binary (0/1) data, the idea of one permutation hashing (OPH) with densification significantly accelerates traditional minwise hashing algorithms while providing unbiased and accurate estimates. In this paper, we propose a strategy named "re-randomization" in the process of densification that could achieve the smallest variance among all densification schemes. The success of this idea naturally inspires us to generalize one permutation hashing to weighted (non-binary) data, which results in the socalled "bin-wise consistent weighted sampling (BCWS)" algorithm.

bin-wise consistent weighted sampling, permutation hashing, re-randomized densification, (2 more...)

Neural Information Processing Systems

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

One Permutation Hashing

Li, Ping, Owen, Art, Zhang, Cun-hui

Neural Information Processing SystemsFeb-15-2020, 00:27:11 GMT

While minwise hashing is promising for large-scale learning in massive binary data, the preprocessing cost is prohibitive as it requires applying (e.g.,) $k 500$ permutations on the data. The testing time is also expensive if a new data point (e.g., a new document or a new image) has not been processed. In this paper, we develop a simple \textbf{one permutation hashing} scheme to address this important issue. While it is true that the preprocessing step can be parallelized, it comes at the cost of additional hardware and implementation. Also, reducing $k$ permutations to just one would be much more \textbf{energy-efficient}, which might be an important perspective as minwise hashing is commonly deployed in the search industry.

minwise, permutation hashing, textbf

Neural Information Processing Systems

Genre: Research Report (0.32)

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.52)

Add feedback

Filters

Collaborating Authors

permutation hashing

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

Re-randomized Densification for One Permutation Hashing and Bin-wise Consistent Weighted Sampling

Reviews: Re-randomized Densification for One Permutation Hashing and Bin-wise Consistent Weighted Sampling

Reviews: Re-randomized Densification for One Permutation Hashing and Bin-wise Consistent Weighted Sampling

Re-randomized Densification for One Permutation Hashing and Bin-wise Consistent Weighted Sampling

One Permutation Hashing

C-OPH: Improving the Accuracy of One Permutation Hashing (OPH) with Circulant Permutations

Re-randomized Densification for One Permutation Hashing and Bin-wise Consistent Weighted Sampling

One Permutation Hashing